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Summary 


Quantitative modeling of human brain activity can provide 
crucial insights about cortical representations [1, 2] and 
can form the basis for brain decoding devices [3-5]. Recent 
functional magnetic resonance imaging (fMRI) studies have 
modeled brain activity elicited by static visual patterns and 
have reconstructed these patterns from brain activity [6-8]. 
However, blood oxygen level-dependent (BOLD) signals 
measured via fMRI are very slow [9], so it has been difficult 
to model brain activity elicited by dynamic stimuli such as 
natural movies. Here we present a new motion-energy [10, 
11] encoding model that largely overcomes this limitation. 
The model describes fast visual information and slow hemo- 
dynamics by separate components. We recorded BOLD 
signals in occipitotemporal visual cortex of human subjects 
who watched natural movies and fit the model separately 
to individual voxels. Visualization of the fit models reveals 
how early visual areas represent the information in movies. 
To demonstrate the power of our approach, we also con- 
structed a Bayesian decoder [8] by combining estimated 
encoding models with a sampled natural movie prior. The 
decoder provides remarkable reconstructions of the viewed 
movies. These results demonstrate that dynamic brain 
activity measured under naturalistic conditions can be de- 
coded using current fMRI technology. 


Results 


Many of our visual experiences are dynamic: perception, visual 
imagery, dreaming, and hallucinations all change continuously 
over time, and these changes are often the most compelling 
and important aspects of these experiences. Obtaining a 
quantitative understanding of brain activity underlying these 
dynamic processes would advance our understanding of 
visual function. Quantitative models of dynamic mental events 
could also have important applications as tools for psychiatric 
diagnosis and as the foundation of brain machine interface 
devices [3-5]. 

Modeling dynamic brain activity is a difficult technical prob- 
lem. The best tool available currently for noninvasive mea- 
surement of brain activity is functional magnetic resonance 
imaging (fMRI), which has relatively high spatial resolution 
[12, 13]. However, blood oxygen level-dependent (BOLD) 
signals measured using fMRI are relatively slow [9], especially 
when compared to the speed of natural vision and many other 
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mental processes. It has therefore been assumed that fMRI 
data would not be useful for modeling brain activity evoked 
during natural vision or by other dynamic mental processes. 

Here we present a new motion-energy [10, 11] encoding 
model that largely overcomes this limitation. The model 
separately describes the neural mechanisms mediating visual 
motion information and their coupling to much slower hemo- 
dynamic mechanisms. In this report, we first validate this en- 
coding model by showing that it describes how spatial and 
temporal information are represented in voxels throughout 
visual cortex. We then use a Bayesian approach [8] to combine 
estimated encoding models with a sampled natural movie 
prior, in order to produce reconstructions of natural movies 
from BOLD signals. 

We recorded BOLD signals from three human subjects while 
they viewed aseries of color natural movies (20° x 20° at 15 Hz). 
A fixation task was used to control eye position. Two separate 
data sets were obtained from each subject. The training data 
set consisted of BOLD signals evoked by 7,200 s of color 
natural movies, where each movie was presented just once. 
These data were used to fit a separate encoding model for 
each voxel located in posterior and ventral occipitotemporal 
visual cortex. The test data set consisted of BOLD signals 
evoked by 540 s of color natural movies, where each movie 
was repeated ten times. These data were used to assess the 
accuracy of the encoding model and as the targets for movie 
reconstruction. Because the movies used to train and test 
models were different, this approach provides a fair and objec- 
tive evaluation of the accuracy of the encoding and decoding 
models [2, 14]. 

BOLD signals recorded from each voxel were fit separately 
using a two-stage process. Natural movie stimuli were first 
filtered by a bank of neurally inspired nonlinear units sensitive 
to local motion-energy [10, 11]. L1-regularized linear regres- 
sion [15, 16] was then used to fit a separate hemodynamic 
coupling term to each nonlinear filter (Figure 1; see also Sup- 
plemental Experimental Procedures available online). The 
regularized regression approach used here was optimized to 
obtain good estimates even for computational models con- 
taining thousands of regressors. In this respect, our approach 
differs from the regression procedures used in many other 
fMRI studies [17, 18]. 

To determine how much motion information is available in 
BOLD signals, we compared prediction accuracy for three 
different encoding models (Figures 2A-2C): a conventional 
static model that includes no motion information [8, 19], 
a nondirectional motion model that represents local motion 
energy but not direction, and a directional model that repre- 
sents both local motion energy and direction. Each of these 
models was fit separately to every voxel recorded in each 
subject, and the test data were used to assess prediction 
accuracy for each model. Prediction accuracy was defined 
as the correlation between predicted and observed BOLD 
signals. The averaged accuracy across subjects and voxels 
in early visual areas (V1, V2, V3, V3A, and V3B) was 0.24, 
0.39, and 0.40 for the static, nondirectional, and directional 
encoding models, respectively (Figures 2D and 2E; see 
Figure S1A for subject- and area-wise comparisons). This 
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difference in prediction accuracy was significant (p < 0.0001, 
Wilcoxon signed-rank test). An earlier study showed that the 
static model tested here recovered much more information 
from BOLD signals than had been obtained with any previous 
model [8, 19]. Nevertheless, both motion models developed 
here provide far more accurate predictions than are obtained 
with the static model. Note that the difference in prediction 
accuracy between the directional and nondirectional motion 
models, though significant, was small (Figure 2E; Figure S1A). 
This suggests that BOLD signals convey spatially localized 
but predominantly nondirectional motion information. These 
results show that the motion-energy encoding model predicts 
BOLD signals evoked by novel natural movies. 

To further explore what information can be recovered from 
these data, we estimated the spatial, spatial frequency, and 
temporal frequency tuning of the directional motion-energy 
encoding model fit to each voxel. The spatial receptive fields 
of individual voxels were spatially localized (Figures 2F and 
2G, left) and were organized retinotopically (Figures 2H and 
21), as reported in previous fMRI studies [12, 19-23]. Voxel- 
based receptive fields also showed spatial and temporal 
frequency tuning (Figures 2F and 2G, right), as reported in 
previous fMRI studies [24, 25]. 

To determine how motion information is represented in 
human visual cortex, we calculated the optimal speed for 
each voxel by dividing the peak temporal frequency by the 
peak spatial frequency. Projecting the optimal speed of the 
voxels onto a flattened map of the cortical surface (Figure 2J) 
revealed a significant positive correlation between eccentricity 
and optimal speed: relatively more peripheral voxels were 
tuned for relatively higher speeds. This pattern was observed 
in areas V1, V2, and V3 and for all three subjects (p < 0.0001, 
t test for correlation coefficient; see Figure S1B for subject- 
and area-wise comparisons). To our knowledge, this is the first 
evidence that speed selectivity in human early visual areas 
depends on eccentricity, though a consistent trend has been 
reported in human behavioral studies [26-28] and in neuro- 
physiological studies of nonhuman primates [29, 30]. These 
results show that the motion-energy encoding model de- 
scribes tuning for both spatial and temporal information at 
the level of single voxels. 

To further characterize the temporal specificity of the 
estimated motion-energy encoding models, we used the test 
data to estimate movie identification accuracy. Identification 
accuracy [7, 19] measures how well a model can correctly 


Figure 1. Schematic Diagram of the Motion-Energy 
Encoding Model 


peer (A) Stimuli pass first through a fixed set of nonlinear 

BOLD signals) | spatiotemporal motion-energy filters (shown in detail in 

B) and then through a set of hemodynamic response 

PIN Ky filters fit separately to each voxel. The summed output 

of the filter bank provides a prediction of BOLD signals. 

Tine (B) The nonlinear motion-energy filter bank consists of 

several filtering stages. Stimuli are first transformed 

into the Commission Internationale de I’Eclairage L*A* 

B* color space, and the color channels are stripped off. 

Luminance signals then pass through a bank of 6,555 

; spatiotemporal Gabor filters differing in position, orien- 

A filtered tation, direction, spatial, and temporal frequency (see 

output Supplemental Experimental Procedures for details). 

Motion energy is calculated by squaring and summing 

DNA Gabor filters in quadrature. Finally, signals pass through 

— a compressive nonlinearity and are temporally down- 
Time 


sampled to the fMRI sampling rate (1 Hz). 


associate an observed BOLD signal pattern with the specific 
stimulus that evoked it. Our motion-energy encoding model 
could identify the specific movie stimulus that evoked an 
observed BOLD signal 95% of the time (464 of 486 volumes) 
within + one volume (1 s; subject S1; Figures 3A and 3B). 
This is far above what would be expected by chance (<1%). 
Identification accuracy (within + one volume) was >75% for 
all three subjects even when the set of possible natural movie 
clips included 1,000,000 separate clips chosen at random from 
the internet (Figure 3C). This result demonstrates that the 
motion-energy encoding model is both valid and temporally 
specific. Furthermore, it suggests that the model might 
provide good reconstructions of natural movies from brain 
activity measurements [5]. 

We used a Bayesian approach [8] to reconstruct movies 
from the evoked BOLD signals (see also Figure S2). We esti- 
mated the posterior probability by combining a likelihood 
function (given by the estimated motion-energy model; see 
Supplemental Experimental Procedures) and a sampled 
natural movie prior. The sampled natural movie prior consists 
of ~18,000,000 s of natural movies sampled at random from 
the internet. These clips were assigned uniform prior proba- 
bility (and consequently all other clips were assigned zero prior 
probability; note also that none of the clips in the prior were 
used in the experiment). Furthermore, to make decoding 
tractable, reconstructions were based on 1 s clips (15 frames), 
using BOLD signals with a delay of 4 s. In effect, this procedure 
enforces an assumption that the spatiotemporal stimulus that 
elicited each measured BOLD signal must be one of the movie 
clips in the sampled prior. 

Figure 4 shows typical reconstructions of natural movies 
obtained using the motion-energy encoding model and the 
Bayesian decoding approach (see Movie S1 for the corre- 
sponding movies). The posterior probability was estimated 
across the entire sampled natural movie prior separately for 
each BOLD signal in the test data. The peak of this posterior 
distribution was the conventional maximum a posteriori 
(MAP) reconstruction [8] for each BOLD signal (see second 
row in Figure 4). When the sampled natural movie prior con- 
tained clips similar to the viewed clip, the MAP reconstructions 
were good (e.g., the close-up of ahuman speaker shown in Fig- 
ure 4A). However, when the prior contained no clips similar to 
the viewed clip, the reconstructions are poor (e.g., Figure 4B). 
This likely reflects both the limited size of the sampled natural 
movie prior and noise in the fMRI measurements. One way to 
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achieve more robust reconstructions without enlarging the 
prior is to interpolate over the sparse samples in the prior. 
We therefore created an averaged high posterior (AHP) re- 
construction by averaging the 100 clips in the sampled natural 
movie prior that had the highest posterior probability (see also 
Figure S2; note that the AHP reconstruction can be viewed as 
a Bayesian version of bagging [31]). The AHP reconstruction 
captures the spatiotemporal structure within a viewed clip 
even when it is completely unique (e.g., the spreading of an 
inkblot from the center of the visual field shown in Figure 4B). 

To quantify reconstruction quality, we calculated the corre- 
lation between the motion-energy content of the original 
movies and their reconstructions (see Supplemental Experi- 
mental Procedures). A correlation of 1.0 indicates perfect 
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Figure 2. The Directional Motion-Energy Model Captures 
Motion Information 


(A) Top: the static encoding model includes only Gabor 
filters that are not sensitive to motion. Bottom: prediction 
accuracy of the static model is shown on a flattened map 
of the cortical surface of one subject (S1). Prediction 
accuracy is relatively poor. 

(B) The nondirectional motion-energy encoding model 
includes Gabor filters tuned to a range of temporal 
frequencies, but motion in opponent directions is pooled. 
Prediction accuracy of this model is better than the static 
model. 

(C) The directional motion-energy encoding model in- 
cludes Gabor filters tuned to a range of temporal fre- 
quencies and directions. This model provides the most 
accurate predictions of all models tested. 

(D and E) Voxel-wise comparisons of prediction accuracy 
between the three models. The directional motion-energy 
model performs significantly better than the other two 
models, although the difference between the nondirec- 
tional and directional motion models is small. See also 
Figure S1 for subject- and area-wise comparisons. 

(F) The spatial receptive field of one voxel (left) and its 
spatial and temporal frequency selectivity (right). This 
receptive field is located near the fovea, and it is high- 
pass for spatial frequency and low-pass for temporal 
frequency. This voxel thus prefers static or low-speed 
motion. 

(G) Receptive field for a second voxel. This receptive field 
is located lower periphery, and it is band-pass for spatial 
frequency and high-pass for temporal frequency. This 
voxel thus prefers higher-speed motion than the voxel 
in (F). 

(H) Comparison of retinotopic angle maps estimated 
using the motion-energy encoding model (top) and 
conventional multifocal mapping (bottom) on a flattened 
cortical map [47]. The angle maps are similar, even 
though they were estimated using independent data 
sets and methods. 

(I) Comparison of eccentricity maps estimated as in (H). 
The maps are similar except in the far periphery, where 
the multifocal mapping stimulus was coarse. 

(J) Optimal speed projected on to a flattened map as in 
(H). Voxels near the fovea tend to prefer slow-speed 
motion, whereas those in the periphery tend to prefer 
high-speed motion. See also Figure S1B for subject- 
wise comparisons. 
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reconstruction of the spatiotemporal energy 
in the original movies, and a correlation of 
0.0 indicates that the movies and their recon- 
struction are spatiotemporally uncorrelated. 
The results for both MAP and AHP reconstruc- 
tions are shown in Figure 4D. In both cases, 
reconstruction accuracy was significantly higher than chance 
(p < 0.0001, Wilcoxon rank-sum test; see Supplemental Exper- 
imental Procedures). Furthermore, AHP reconstructions were 
significantly better than MAP reconstructions (p < 0.0001, 
Wilcoxon signed-rank test). Although still crude (motion- 
energy correlation ~ 0.3), these results validate our general 
approach to reconstruction and demonstrate that the AHP 
estimate improves reconstruction over the MAP estimate. 


degree/s 


Discussion 


In this study, we developed an encoding model that pre- 
dicts BOLD signals in early visual areas with unprecedented 
accuracy. By using this model in a Bayesian framework, we 
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Figure 3. Identification Analysis 


(A) Identification accuracy for one subject (S1). The test data in our experiment consisted of 486 volumes (s) of BOLD signals evoked by the test movies. The 
estimated model yielded 486 volumes of BOLD signals predicted for the same movies. The brightness of the point in the mth column and nth row represents 
the log-likelihood (see Supplemental Experimental Procedures) of the BOLD signals evoked at the mth second given the BOLD signal predicted at the nth 
second. The highest log-likelihood in each column is designated by a red circle and thus indicates the choice of the identification algorithm. 

(B) Temporal offset between the correct timing and the timing identified by the algorithm for the same subject shown in (A). The algorithm was correct to 
within + one volume (s) 95% of the time (464 of 486 volumes); chance performance is <1% (3 of 486 volumes; i.e., three volumes centered at the correct 
timing). 

(C) Scaling of identification accuracy with set size. To understand how identification accuracy scales with size of stimulus set, we enlarged the identification 
stimulus set to include additional stimuli drawn from a natural movie database (which was not actually used in the experiment). For all three subjects, iden- 
tification accuracy (within + one volume) was >75% even when the set of potential movies included 1,000,000 clips. This is far above chance (gray dashed 
line). 


provide the first reconstructions of natural movies from human 
brain activity. This is a critical step toward the creation of brain 
reading devices that can reconstruct dynamic perceptual 
experiences. Our solution to this problem rests on two key 
innovations. The first is a new motion-energy encoding model 
that is optimized for use with fMRI and that aims to reflect the 
separate contributions of the underlying neuronal population 
and hemodynamic coupling (Figure 1). This encoding model 
recovers fine temporal information from relatively slow BOLD 
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Figure 4. Reconstructions of Natural Movies from BOLD Signals 
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signals. The second is a sampled natural movie prior that is 
embedded within a Bayesian decoding framework. This 
approach provides a simple method for reconstructing spatio- 
temporal stimuli from the sparsely sampled and slow BOLD 
signals. 

Our results provide the first evidence that there is a positive 
correlation between eccentricity and optimal speed in human 
early visual areas. This provides a functional explanation for 
previous behavioral studies indicating that speed sensitivity 
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(A) The first (top) row shows three frames from a natural movie used in the experiment, taken 1 s apart. The second through sixth rows show frames from the 
five clips with the highest posterior probability. The maximum a posteriori (MAP) reconstruction is shown in the second row. The seventh (bottom) row 
shows the averaged high posterior (AHP) reconstruction. The MAP provides a good reconstruction of the second and third frames, whereas the AHP 
provides more robust reconstructions across frames. 

(B and C) Additional examples of reconstructions, in the same format as (A). 

(D) Reconstruction accuracy (correlation in motion-energy; see Supplemental Experimental Procedures) for all three subjects. Error bars indicate +1 stan- 
dard error of the mean across 1 s clips. Both the MAP and AHP reconstructions are significant, though the AHP reconstructions are significantly better than 
the MAP reconstructions. Dashed lines show chance performance (p = 0.01). See also Figure S2. 
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depends on eccentricity [26-28]. This systematic variation in 
optimal speed across the visual field may be an adaptation 
to the nonuniform distribution of speed signals induced by 
selective foveation in natural scenes [32]. From the perspec- 
tive of decoding, this result suggests that we might further 
optimize reconstruction by including eccentricity-dependent 
speed tuning in the prior. 

We found that a motion-energy model that incorporates 
directional motion signals was only slightly better than a model 
that does not include direction. We believe that this likely 
reflects limitations in the spatial resolution of fMRI recordings. 
Indeed, a recent study reported that hemodynamic signals 
were sufficient to visualize a columnar organization of motion 
direction in macaque area V2 [33]. Future fMRI experiments 
at higher spatial or temporal resolution [34, 35] might therefore 
be able to recover clearer directional signals in human visual 
cortex. 

In preliminary work for this study, we explored several en- 
coding models that incorporated color information explicitly. 
However, we found that color information did not improve 
the accuracy of predictions or identification beyond what 
could be achieved with models that include only luminance 
information. We believe that this reflects the fact that lumi- 
nance and color borders are often correlated in natural scenes 
([36, 37], but see [38]). (Note that when isoluminant, monochro- 
matic stimuli are used, color can be reconstructed from 
evoked BOLD signals [39].) The correlation between luminance 
and color information in natural scenes has an interesting side 
effect: our reconstructions tended to recover color borders 
(e.g., borders between hair versus face or face versus body), 
even though the encoding model makes no use of color infor- 
mation. This is a positive aspect of the sampled natural movie 
prior and provides additional cues to aid in recognition of re- 
constructed scenes (see also [40)). 

We found that the quality of reconstruction could be 
improved by simply averaging around the maximum of the 
posterior movies. This suggests that reconstructions might 
be further improved if the number of samples in the prior is 
much larger than the one used here. Likelihood estimation 
(and thus reconstruction) would also improve if additional 
knowledge about the neural representation of movies was 
used to construct better encoding models (e.g., [41]). 

In a landmark study, Thirion et al. [6] first reconstructed 
static imaginary patterns from BOLD signals in early visual 
areas. Other studies have decoded subjective mental states, 
such as the contents of visual working memory [42], or whether 
subjects are attending to one or another orientation or direc- 
tion [3, 43]. The modeling framework presented here provides 
the first reconstructions of dynamic perceptual experiences 
from BOLD signals. Therefore, this modeling framework might 
also permit reconstruction of dynamic mental content such as 
continuous natural visual imagery. In contrast to earlier studies 
that reconstruct visual patterns defined by checkerboard 
contrast [6, 7], our framework could potentially be used to 
decode involuntary subjective mental states (e.g., dreaming 
or hallucination), though it would be difficult to determine 
whether the decoded content was accurate. One recent study 
showed that BOLD signals elicited by visual imagery are more 
prominent in ventral-temporal visual areas than in early visual 
areas [44]. This finding suggests that a hybrid encoding model 
that combines the structural motion-energy model developed 
here with a semantic model of the form developed in previous 
studies [8, 45, 46] could provide even better reconstructions of 
subjective mental experiences. 


Experimental Procedures 


Stimuli 

Visual stimuli consisted of color natural movies drawn from the Apple Quick- 
Time HD gallery (http://trailers.apple.com/) and YouTube (http://www. 
youtube.com/; see the list of movies in Supplemental Experimental Proce- 
dures). The original high-definition movies were cropped to a square 
and then spatially downsampled to 512 x 512 pixels. Movies were then 
clipped to 10-20 s in length, and the stimulus sequence was created by 
randomly drawing movies from the entire set. Movies were displayed using 
a VisuaStim LCD goggle system (20° x 20° at 15 Hz). A colored fixation spot 
(4 pixels or 0.16° square) was presented on top of the movie. The color of the 
fixation spot changed three times per second to ensure that it was visible 
regardless of the color of the movie. 


MRI Parameters 

The experimental protocol was approved by the Committee for the Protec- 
tion of Human Subjects at University of California, Berkeley. Functional 
scans were conducted using a 4 Tesla Varian INOVA scanner (Varian, Inc.) 
with a quadrature transmit/receive surface coil (Midwest RF). Scans were 
obtained using T2*-weighted gradient-echo EPI: TR = 1 s, TE = 28 ms, flip 
angle = 56°, voxel size = 2.0 x 2.0 x 2.5 mm®, FOV = 128 x 128 mm”. The 
slice prescription consisted of 18 coronal slices beginning at the posterior 
pole and covering the posterior portion of occipital cortex. 


Data Collection 
Functional MRI scans were made from three human subjects, S1 (author 
S.N., age 30), S2 (author T.N., age 34), and S3 (author A.T.V., age 23). All 
subjects were healthy and had normal or corrected-to-normal vision. The 
training data were collected in 12 separate 10 min blocks (7,200 s total). 
The training movies were shown only once each. The test data were 
collected in nine separate 10 min blocks (5,400 s total) consisting of 9 min 
movies repeated ten times each. To minimize effects from potential adapta- 
tion and long-term drift in the test data, we divided the 9 min movies into 
1 min chunks, and these were randomly permuted across blocks. Each 
test block was thus constructed by concatenating ten separate 1 min 
movies. All data were collected across multiple sessions for each subject, 
and each session contained multiple training and test blocks. The training 
and test data sets used different movies. 
Additional methods can be found 
Procedures. 


in Supplemental Experimental 


Supplemental Information 


Supplemental Information includes two figures, Supplemental Experimental 
Procedures, and one movie and can be found with this article online at 
doi:10.1016/j.cub.2011.08.031. 
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